Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The state-of-the-art approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning. However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content. To alleviate these problems, we propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. Using Stable Diffusion, a state-of-the-art latent diffusion model, we demonstrate that EDICT successfully reconstructs real images with high fidelity. On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits--from local and global semantic edits to image stylization--while maintaining fidelity to the original image structure. EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM. Code is available at https://github.com/salesforce/EDICT.
translated by 谷歌翻译
基于注意的蛋白质序列训练的基于注意力的模型在分类和与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。但是,我们对非常大规模的模型和数据在有效的蛋白质模型开发中发挥作用。我们介绍了一套名为progen2的蛋白质语言模型的套件,该模型最高为6.4b参数,并在从基因组,宏基因组和免疫曲目数据库中绘制的不同序列数据集上进行了培训。 GEECEN2模型在捕获观察到的进化序列的分布,生成新型的可行序列并预测蛋白质适应性的情况下显示出最先进的性能,而无需额外的芬特。随着蛋白质序列的大型大小和原始数量继续变得更加广泛,我们的结果表明,越来越多的重点需要放在提供给蛋白质序列模型的数据分布上。我们在https://github.com/salesforce/progen上发布了PECEN2模型和代码。
translated by 谷歌翻译
我们提出了Clip-Lite,一种通过与文本注释的特征对齐方式进行视觉表示学习的信息有效方法。与先前提出的剪辑模型相比,剪辑液在优化其对比学学习目标期间只需要一个负图像文本样本对。我们通过利用信息有效的较低限制来实现这一点,以最大化两个输入模态之间的相互信息。这允许剪辑Lite培训,在获得比夹子的更好的性能的同时具有显着减少的数据和批量尺寸。我们通过在Coco-Tablions数据集上预先绘制来评估剪贴画并对其他数据集进行测试传输。 Clip-Lite在Pascal VOC分类上获得+ 15.4%的映射绝对增益,并在ImageNet上获得A + 22.1%的前1个精度增益,同时与其他更复杂,文本监督模型相当或优越。 Clip-Lite还优于剪辑图像和文本检索,零拍分类和视觉接地。最后,通过在表示学习期间执行显式图像文本对齐,我们显示Clip-Lite可以利用语言语义来鼓励可以在下游任务中使用的无偏见的视觉表示。
translated by 谷歌翻译
由于存在对象的自然时间转换,视频是一种具有自我监督学习(SSL)的丰富来源。然而,目前的方法通常是随机采样用于学习的视频剪辑,这导致监督信号差。在这项工作中,我们提出了预先使用无监督跟踪信号的SSL框架,用于选择包含相同对象的剪辑,这有助于更好地利用对象的时间变换。预先使用跟踪信号在空间上限制帧区域以学习并通过在Grad-CAM注意图上提供监督来定位模型以定位有意义的物体。为了评估我们的方法,我们在VGG-Sound和Kinetics-400数据集上培训势头对比(MOCO)编码器,预先使用预先。使用Previts的培训优于Moco在图像识别和视频分类下游任务中独自学习的表示,从而获得了行动分类的最先进的性能。预先帮助学习更强大的功能表示,以便在背景和视频数据集上进行背景和上下文更改。从大规模未婚视频中学习具有预算的大规模未能视频可能会导致更准确和强大的视觉功能表示。
translated by 谷歌翻译
We study algorithms for detecting and including glass objects in an optimization-based Simultaneous Localization and Mapping (SLAM) algorithm in this work. When LiDAR data is the primary exteroceptive sensory input, glass objects are not correctly registered. This occurs as the incident light primarily passes through the glass objects or reflects away from the source, resulting in inaccurate range measurements for glass surfaces. Consequently, the localization and mapping performance is impacted, thereby rendering navigation in such environments unreliable. Optimization-based SLAM solutions, which are also referred to as Graph SLAM, are widely regarded as state of the art. In this paper, we utilize a simple and computationally inexpensive glass detection scheme for detecting glass objects and present the methodology to incorporate the identified objects into the occupancy grid maintained by such an algorithm (Google Cartographer). We develop both local (submap level) and global algorithms for achieving the objective mentioned above and compare the maps produced by our method with those produced by an existing algorithm that utilizes particle filter based SLAM.
translated by 谷歌翻译
Language models are widely deployed to provide automatic text completion services in user products. However, recent research has revealed that language models (especially large ones) bear considerable risk of memorizing private training data, which is then vulnerable to leakage and extraction by adversaries. In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text, while varying other factors such as model size and adversarial conditions. We test both "heuristic" mitigations (those without formal privacy guarantees) and Differentially Private training, which provides provable levels of privacy at the cost of some model performance. Our experiments show that (with the exception of L2 regularization), heuristic mitigations are largely ineffective in preventing memorization in our test suite, possibly because they make too strong of assumptions about the characteristics that define "sensitive" or "private" text. In contrast, Differential Privacy reliably prevents memorization in our experiments, despite its computational and model-performance costs.
translated by 谷歌翻译
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Reflections on glossy objects contain valuable and hidden information about the surrounding environment. By converting these objects into cameras, we can unlock exciting applications, including imaging beyond the camera's field-of-view and from seemingly impossible vantage points, e.g. from reflections on the human eye. However, this task is challenging because reflections depend jointly on object geometry, material properties, the 3D environment, and the observer viewing direction. Our approach converts glossy objects with unknown geometry into radiance-field cameras to image the world from the object's perspective. Our key insight is to convert the object surface into a virtual sensor that captures cast reflections as a 2D projection of the 5D environment radiance field visible to the object. We show that recovering the environment radiance fields enables depth and radiance estimation from the object to its surroundings in addition to beyond field-of-view novel-view synthesis, i.e. rendering of novel views that are only directly-visible to the glossy object present in the scene, but not the observer. Moreover, using the radiance field we can image around occluders caused by close-by objects in the scene. Our method is trained end-to-end on multi-view images of the object and jointly estimates object geometry, diffuse radiance, and the 5D environment radiance field.
translated by 谷歌翻译
Deep Reinforcement Learning (DRL) has the potential to be used for synthesizing feedback controllers (agents) for various complex systems with unknown dynamics. These systems are expected to satisfy diverse safety and liveness properties best captured using temporal logic. In RL, the reward function plays a crucial role in specifying the desired behaviour of these agents. However, the problem of designing the reward function for an RL agent to satisfy complex temporal logic specifications has received limited attention in the literature. To address this, we provide a systematic way of generating rewards in real-time by using the quantitative semantics of Signal Temporal Logic (STL), a widely used temporal logic to specify the behaviour of cyber-physical systems. We propose a new quantitative semantics for STL having several desirable properties, making it suitable for reward generation. We evaluate our STL-based reinforcement learning mechanism on several complex continuous control benchmarks and compare our STL semantics with those available in the literature in terms of their efficacy in synthesizing the controller agent. Experimental results establish our new semantics to be the most suitable for synthesizing feedback controllers for complex continuous dynamical systems through reinforcement learning.
translated by 谷歌翻译
Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our re-ranking technique outperforms state-of-the-art method on the AmbigQA dataset.
translated by 谷歌翻译